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Abstract 

The linear regression model is widely used in empirical work in Economics, Statis¬ 
tics, and many other disciplines. Researchers often include many covariates in their 
linear model specification in an attempt to control for confounders. We give infer¬ 
ence methods that allow for many covariates and heteroskedasticity. Our results are 
obtained using high-dimensional approximations, where the number of included co¬ 
variates are allowed to grow as fast as the sample size. We find that all of the usual 
versions of Eicker-White heteroskedasticity consistent standard error estimators for 
linear models are inconsistent under this asymptotics. We then propose a new het¬ 
eroskedasticity consistent standard error formula that is fully automatic and robust to 
both (conditional) heteroskedasticity of unknown form and the inclusion of possibly 
many covariates. We apply our findings to three settings: parametric linear models 
with many covariates, linear panel models with many fixed effects, and semiparamet- 
ric semi-linear models with many technical regressors. Simulation evidence consistent 
with our theoretical results is also provided. The proposed methods are also illustrated 
with an empirical application. 
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1 Introduction 


A key goal in empirical work is to estimate the structural, causal, or treatment effect of some 
variable on an outcome of interest, such as the impact of a labor market policy on outcomes 
like earnings or employment. Since many variables measuring policies or interventions are 
not exogenous, researchers often employ observational methods to estimate their effects. 
One important method is based on assuming that the variable of interest can be taken as 
exogenous after controlling for a sufficiently large set of other factors or covariates. A major 
problem that empirical researchers face when employing selection-on-observables methods 
to estimate structural effects is the availability of many potential covariates. This problem 
has become even more pronounced in recent years because of the widespread availability of 
large (or high-dimensional) new data sets. 

Not only it is often the case that substantive discipline-specific theory (or intuition) will 
suggest a large set of variables that might be important, but also researchers usually prefer 
to include additional “technical” controls constructed using indicator variables, interactions, 
and other non-linear transformations of those variables. Therefore, many empirical studies 
include very many covariates in order to control for as broad array of confounders as possible. 
For example, it is common practice to include dummy variables for many potentially over¬ 
lapping groups based on age, cohort, geographic location, etc. Even when some controls are 
dropped after valid covariate selection (Belloni, Chernozhukov, and Hansen (2014)), many 
controls usually may remain in the final model specification. For example, Angrist and Hahn 
(2004) discuss when to include many covariates in treatment effect models. 

We present valid inference methods that explicitly account for the presence of possibly 
many controls in linear regression models under (conditional) heteroskedasticity. We consider 
the setting where the object of interest is /9 in a model of the form 

i = 1, . . . , 77 ,, (1) 

where Ui^n is a scalar outcome variable, is a regressor of small (i.e., fixed) dimension d. 
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„ is a vector of covariates of possibly “large” dimension Kn, and is an unobserved 
error term. Two important cases discussed in more detail below, are “flexible” paramet¬ 
ric modeling of controls via basis expansions such as higher-order powers and interactions 
(i.e., a series-based formulation of the partially linear regression model), and models with 
many dummy variables such as multi-way hxed effects and interactions thereof in panel data 
models. In both cases conducting OLS-based inference on f3 in (1) is straightforward when 
the error „ is homoskedastic and/or the dimension of the nuisance covariates is mod¬ 
eled as a vanishing fraction of the sample size. The latter modeling assumption, however, 
is inappropriate in applications with many dummy variables and does not deliver a good 
distributional approximation when many covariates are included. 

Motivated by the above observations, this paper studies the consequences of allowing the 
error in (1) to be (conditionally) heteroskedastic in a setting where the covariate is 
permitted to be high-dimensional in the sense that Kn is allowed, but not required, to be a 
non-vanishing fraction of the sample size. Our main purpose is to investigate the possibility 
of constructing heteroskedasticity-consistent variance estimators for the OLS estimator of [3 
in (1) without (necessarily) assuming any special structure on the part of the covariate 
We present two main results. First, we provide high-level sufficient conditions guaranteeing 
a valid Gaussian distributional approximation to the hnite sample distribution of the OLS 
estimator of (3, allowing for the dimension of the nuisance covariates to be “large” relative 
to the sample size (i.e., K^/n 0). Second, we characterize the large sample properties 
of a class of variance estimators, and use this characterization to obtain both negative and 
positive results. The negative hnding is that the Bicker-White estimator is inconsistent in 
general, as are popular variants of this estimator. The positive result gives conditions under 
which an alternative heteroskedasticity-robust variance estimator (described in more detail 
below) is consistent. The main condition needed for our constructive results is a high-level 
assumption on the nuisance covariates requiring in particular that their number be strictly 
less than half of the sample size. As a by-product, we also hnd that among the popular HC/c 
class of standard errors estimators for linear models, a variant of the HC3 estimator delivers 
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standard errors that are asymptotically upward biased in general. Thus, standard OLS 
inference employing HC3 standard errors will be asymptotically valid, albeit conservative, 
even in high-dimensional settings where the number of covariate Wj „ is large relative to the 
sample size, i.e., when Kn/n 0. 

Our results contribute to the already sizeable literature on heteroskedasticity-robust vari¬ 
ance estimators for linear regression models, a recent review of which is given by MacKinnon 
(2012). Important papers whose results are related to ours include White (1980), MacKinnon and White 
(1985), Wu (1986), Chesher and Jewitt (1987), Shao and Wu (1987), Chesher (1989), Cribari-Neto, Ferrari, 
(2000), Kauermann and Carroll (2001), Bera, Suprayitno, and Premaratne (2002), Stock and Watson 
(2008), Cribari-Neto and da Gloria A. Lima (2011), Muller (2013), and Abadie, Imbens, and Zheng 
(2014). In particular, Bera, Suprayitno, and Premaratne (2002) analyze some hnite sample 
properties of a variance estimator similar to the one whose asymptotic properties are stud¬ 
ied herein. They use unbiasedness or minimum norm quadratic unbiasedness to motivate 
a variance estimator that is similar in structure to ours, but their results are obtained for 
hxed Kn and n and is silent about the extent to which consistent variance estimation is even 
possible when Kn/n -f)- 0. 

This paper also adds to the literature on high-dimensional linear regression where the 
number of regressors grow with the sample size; see, e.g., Huber (1973), Koenker (1988), 

Mammen (1993), El Karoui, Bean, Bickel, Lim, and Yu (2013), Zheng, Jiang, Bai, and He 
(2014), Li and Muller (2017), and references therein. In particular, Huber (1973) showed 
that htted regression values are not asymptotically normal when the number of regres¬ 
sors grows as fast as sample size, while Mammen (1993) obtained asymptotic normality 
for arbitrary contrasts of OLS estimators in linear regression models where the dimen¬ 
sion of the covariates is at most a vanishing fraction of the sample size. More recently, 

El Karoui, Bean, Bickel, Lim, and Yu (2013) showed that, if a Gaussian distributional as¬ 
sumption on regressors and homoskedasticity is assumed, then certain estimated coefficients 
and contrasts in linear models are asymptotically normal when the number of regressors 
grow as fast as sample size, but do not discuss inference results (even under homoskedas- 
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ticity). Our result in Theorem 1 below shows that certain contrasts of OLS estimators in 
high-dimensional linear models are asymptotically normal under fairly general regularity 
conditions. Intuitively, we circumvent the problems associated with the lack of asymptotic 
Gaussianity in general high-dimensional linear models by focusing exclusively on a small 
subset of regressors when the number of covariates gets large. We give inference results by 
constructing heteroskedasticity consistent standard errors without imposing any distribu¬ 
tional assumption or other very specihc restrictions on the regressors. In particular, we do 
not require the coefficients 7 „ to be consistently estimated; in fact, they will not be in most 
of our examples discussed below. 

Our high-level conditions allow for Kn oc n and restrict the data generating process in 
fairly general and intuitive ways. In particular, our generic sufficient condition on the nui¬ 
sance covariates covers several special cases of interest for empirical work. For example, 
our results encompass (and weakens in certain sense) those reported in Stock and Watson 
(2008), who investigated the one-way hxed effects panel data regression model and showed 
that the conventional Eicker-White heteroskedasticity-robust variance estimator is inconsis¬ 
tent, being plagued by a non-negligible bias problem attributable to the presence of many 
covariates (i.e., the hxed effects). The very special structure of the covariates in the one-way 
hxed ehects estimator enables an explicit characterization of this bias, and also leads to a 
direct plug-in consistent bias-corrected version of the Eicker-White variance estimator. The 
generic variance estimator proposed herein essentially reduces to this bias-corrected variance 
estimator in the special case of the one-way hxed ehects model, even though our results are 
derived from a diherent perspective and generalize to other settings. 

Furthermore, our general inference results can be used when many multi-way hxed ehects 
and similar discrete covariates are introduced in a linear regression model, as it is usually 
the case in social interaction and network settings. For example, in a very recent contri¬ 
bution, Verdier (2017) develops new results for two-way hxed ehect design and projection 
matrices, and use them to verify our high-level conditions in linear models with two-way 
unobserved heterogeneity and sparsely matched data (which can also be interpreted as a 
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network setting). These results provide another interesting and empirically relevant illustra¬ 
tion of our generic theory. Verdier (2017) also develops inference results able to handle time 
series dependence in his specific context, which are not covered by our assumptions because 
we impose independence in the cross-sectional dimension of the (possibly grouped) data. 

The rest of this paper is organized as follows. Section 2 presents the variance estimators 
we study and gives a heuristic description of their main properties. Section 3 introduces 
our general framework, discusses high-level assumptions and illustrates the applicability of 
our methods using three leading examples. Section 4 gives the main results of the paper. 
Section 5 reports the results of a Monte Carlo experiment, while 6 illustrates our methods 
using an empirical application. Section 7 concludes. Proofs and additional methodological 
and numerical results are reported in the online supplemental appendix. 

2 Overview of Results 

For the purposes of discussing distribution theory and variance estimators associated with 
the OLS estimator /3„ of (3 in (1), when possibly the dimensional nuisance covariates 
Wi^n is of “large” dimension and/or the parameters 7 „ cannot be estimated consistently, it 
is convenient to write the estimator in “partialled out” form as 

n n n 

Pn — ( ^ ^ ^ ^i,nyi,n)i ^i,n — ^ ^ Mij,n^j,n^ 

i=l i=l j=l 

where = l(vi = j) — w' „(X]fc=i ^k,n^'k ]!(■) denotes the indicator function, and 

the relevant inverses are assumed to exist. Defining ^^e objective is 

to establish a valid Gaussian distributional approximation of the hnite sample distribution 
of the OLS estimator /3„, and then find an estimator of the variance of Ym=i \/n 

such that 

- /?) AA(0, I), (ln = f-'Snf(2) 
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in which case asymptotic valid inference on can be conducted in the usual way by employing 
the distributional approximation /3„ ~ A/'(/S, Cln/n). Our assumptions below will ensure that 
Pn remains -y/n-consistent because we show in the supplemental appendix that = Op(l) 
even when Kn/n 0. 

Our hrst result, Theorem 1 below, gives sufficient conditions for a valid Gaussian approxi¬ 
mation of the distribution of the infeasible statistic r2n^^^\/n(/3n—/9), where = f 
and denotes the variance of Yll=i^i,nUi,nl\/n^ even when possibly Kn/n 0 and the 
linear regression model exhibits conditional heteroskedasticity. This result, in turn, gives the 
basic ingredient for discussing valid variance estimation in high-dimensional linear regression 
models. Defining ^ij,n{yj,n — $'n^j,n), standard choices of in the fixed- 

case include the homoskedasticity-only estimator 



cr„ 


1 

n — d — Kn 


i=l 


and the Eicker-White-type estimator 



i=l 


Perhaps not too surprisingly, in Theorem 2 below, we find that consistency of under 
homoskedasticity holds quite generally even for models with many covariates. In contrast, 
construction of a heteroskedasticity-robust estimator of is more challenging, as it turns 
out that consistency of S™ generally requires to be a vanishing fraction of n. 

To fix ideas, suppose x'w'^„) are i.i.d. over i. It turns out that, under certain 
regularity conditions. 


^ n n 

i=l j=l 


'■Kn 


E[m 


2 

j,n 


^j,ni Wyn] T 
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whereas a requirement for (2) to hold is that the estimator satishes 


S 


n 


1 X \ 

- ^i,n] + Op(l). 

^ i=l 


( 3 ) 


The difference between the leading terms in the expansions is non-negligible in general unless 
Kn/n —)■ 0. In recognition of this problem with S™, we study the more general class of 
estimators of the form 

^ n n 

^n(f^n) = — y ^ y ^ 

i=l j=l 

where Hij^n denotes element (i, j) of a symmetric matrix Kn = K,n(y^i,n, ■ ■ ■, ^n,n)- Estimators 
that can be written in this fashion include S™ (which corresponds to Kn = In) as well as 
variants of the so-called HC/c estimators, k G {1, 2, 3,4}, reviewed by Long and Ervin (2000) 
and MacKinnon (2012), among many others. To be specihc, a natural variant of HC/c is 
obtained by choosing Kn to be diagonal with Ku^n = where (Tj_n,'Ci,n) = (1,0) for 

HCO (and corresponding to S™), = {n/{n - Kn),0) for HCl, (Ti,„,^i,„) = (1,1) 

for HC2, (Ti,n,6,n) = (1,2) for HC3, and = (1, min(4, nM^^^^/iC^)) for HC4. See 

Sections 4.3 for more details. 

In Theorem 3 below, we show that all of the HC/c-type estimators, which correspond to 
a diagonal choice of Kn, have the shortcoming that they do not satisfy (3) when Kn/n 0. 
On the other hand, it turns out that a certain non-diagonal choice of Kn makes it possible to 
satisfy (3) even if Kn is a non-vanishing fraction of n. To be specihc, it turns out that (under 
regularity conditions and) under mild conditions under the weights Kij^n, S„(fi:„) satishes 


Kr,, = 


n 


i=l j=l k=l 


Mkj,rSi 




E[m 


2 


Wj-n] + Op(l), 


(4) 


suggesting that (3) holds with = !]„(«:„) provided Kn is chosen in such a way that 


Kik,nMlj^^ = l{i = i), l<i,3 < n. (5) 

k=l 
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Accordingly, we define 


i:!'' = Sr 


- - 


EE 

1=1 j=i 


V- 


-/ -2 


where, with M„ denoting the matrix with element {i,j) given by Mij^n and © denoting the 
Hadamard product. 


= 


K 


K. 


.HC 

'll,n 

^ln,n 

.HC 

■ 

'^nn,n 


\ 


( 




V 


Mir. 



(M„0M„)-^ 






The estimator S^^ is well dehned whenever 0 is invertible, a simple sufficient con¬ 
dition for which is that M.n < 1/2, where 


Mn = l- min 

l<i<n 

The fact that A4.n <1/2 implies invertibility of Mn©Mn is a consequence of the Gershgorin 
circle theorem. For details, see Section 3 in the supplemental appendix. More importantly, 
a slight strengthening of the condition Ain <1/2 will be shown to be sufficient for (2) and 
(3) to hold with S„ = S^*^. Our hnal result. Theorem 4 below, formalizes this hnding (see 
also the supplemental appendix for further intuition underlying this result). 

The key intuition underlying our variance estimation result is that, even though each con¬ 
ditional variance Wj^„] cannot be well estimated due to the curse of dimensionality, 

an averaged version such as the leading term in (3) can be estimated consistently. Thus, 
taking W(a\n\^i,n,^i,n] = Y2=i ^ik,nul,n an estimator of E[uln\:>^i,n,^i,n], plugging into 
the leading term in (3), and computing conditional expectations, we obtain the leading term 
in (4). To make this leading term equal to the desired target 






it is natural to require 


Y ^j,n\ = 1 < i < U. 

j=l k=l 

Since E[M^^|xj Wj_„] are unknown, our variance estimator solves (5), which generates enough 
equations to solve for all n{n — l)/2 possibly distinct elements in . 


Remark 1. = n 




Er=i ^i,n^,nuln witli ^nd therefore can be 


interpreted as a bias-corrected “estimator” of (the conditional expectation of) 


3 Setup 

This section introduces a general framework encompassing several special cases of linear-in- 
parameters regression models of the form (1). We hrst present generic high-level assumptions, 
and then discuss their implications as well as some easier to verify sufficient conditions. Fi¬ 
nally, to close this setup section, we briefly discuss three motivating leading examples: linear 
regression models with increasing dimension, muti-way hxed effect linear models, and semi- 
parametric semi-linear regression. Technical details and related results for these examples 
are given in the supplemental appendix. 

3.1 Framework 

Suppose x'w' „) : 1 < i < n} is generated by (1). Let || ■ || denote the Euclidean 

norm, set Xn = (xi^„,..., x„^„), and for a collection Wn of random variables satisfying 
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IE[wj^„|Wn] = define the constants 


1 

0n = - Ri^n = E[Ui^n\Xn, Wn], 

i=l 

1 ” 

Pn = - '^E[rf J, Ti^n = E[Mi,„|>Vn], 

i=l 

1 ” 

Xn = - ^E[||Qi,„|p], Qi^n = E[Vj,„|W„], 
i=l 

where - (Xlj=i IE[xj-„w'. ^])(X;j=i E[wj'„w'is the population counterpart 

of Vj „. Also, dehne 

= niax{E[t/fjA„,>V„] +E[||V,,„f |>V„] + 1/E[17fj >V„]} + l/A„in(E[fj WJ)}, 

l<i<n 

where Ui -n yi,n E[|/j^^| W^], ^i,n E[xj ,j|V\Ai], T^j i n/^i and 

V- = V"" M - V ■ 

We impose the following three high-level conditions. Let Amin(-) denote the minimum 
eigenvalue of its argument, and hm„_,.ooan = limsup„_,.go for any sequence a„. 

Assumption 1 (Sampling) C[17i,„, Lj-n| Wn] = 0 for i ^ j and maxi<i<Ar„ = 

0(1), where ffTi^n is the cardinality of Ti^n and where {Ti^n : 1 < * < Nn} is a partition of 
{1,..., n} such that {{Ut^n, Vt^n) ■ t G 7i,n} are independent over i conditional on Wn- 

Assumption 2 (Design) P[Amin(Z]r=in) > 0] ^ hm„^ooA'„/n < 1, and Cn = 

0 ,( 1 ). 

Assumption 3 (Approximations) Xn = 0(1), Qn + n{gn — Pn) + nXnQn = o(l), and 

maxi<j<„ ||vi,„||/v^ = 0,(1). 

3.2 Discussion of Assumptions 

Assumptions 1-3 are meant to be high-level and general, allowing for different linear-in- 
parameters regression models. We now discuss the main restrictions imposed by these 
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assumptions. We further illustrate them in the following subsection using more specihc 
examples. 

3.2.1 Assumption 1 

This assumption concerns the sampling properties of the observed data. It generalizes clas¬ 
sical i.i.d. sampling by allowing for groups or “clusters” of hnite but possibly heterogeneous 
size with arbitrary intra-group dependence, which is very common in the context of hxed 
effects linear regression models. As currently stated, this assumption does not allow for 
dependence in the error terms across units, and therefore excludes clustered, spacial or time 
series dependence in the sample. We conjecture our main results extend to the latter cases, 
though here we focus on i.n.i.d. (conditionally) heteroskedastic models only, and hence rele¬ 
gate the extension to errors exhibiting clustered, spacial or time series dependence for future 
work. Assumption 1 reduces to classical i.i.d. sampling when Nn = n, Ti^n = {*} [implying 
maxi<j<Ar^ = 1], and all observations have the same distribution. 

3.2.2 Assumption 2 

This assumption concerns basic design features of the linear regression model. The hrst two 
restrictions are mild and reflect the main goal of this paper, that is, analyzing linear regression 
models with many nuisance covariates Wj „. In practice, the hrst restriction regarding the 
minimum eigenvalue of the design matrix n is always imposed by removing 

redundant (i.e., linearly dependent) covariates; from a theoretical perspective this condition 
requires either restrictions on the distributional relationship of such covariates or some form 
of trimming leading to selection of included covariates (e.g., most software packages remove 
covariates leading to “too” small eigenvalues of the design matrix by means of some hard- 
thresholding rule). 

On the other hand, the last condition, Cn = Op{l), may be restrictive in some settings: for 
example, if the covariates have unbounded support (e.g., they are normally distributed) and 
heteroskedasticity is unbounded (e.g., unbounded multiplicative heteroskedasticity), then 
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the assumption may fail. Simple sufficient conditions for = Op(l) can be formulated 
when the covariates have compact support, or the heteroskedasticity is multiplicative and 
bounded, because in these cases it is easy to bound the conditional moments of the error 
terms. It would be useful to know whether the condition = Op(l) can be relaxed to a 
version involving only unconditional moments, though we conjecture this weaker assumption 
will require a different method of proof (see the supplemental appendix for details). 

3.2.3 Assumption 3 

This assumption requires two basic approximations to hold. First, concerning bias, condi¬ 
tions on Qn are related to the approximation quality of the linear-in-parameters model (1) 
for the “long” conditional expectation Wn]. Similarly, conditions on and Xn 

are related to linear-in-parameters approximations for the “short” conditional expectations 
lE[|/i,n|hVn] and E[xj„|>V„], respectively. All these approximations are measured in terms 
of population mean square error, and are at the heart of empirical work employing linear- 
in-parameters regression models. Depending on the model of interest, different sufficient 
conditions can be given for these assumptions. Here we briefly mention the most simple one: 
(a) if E[Mj >V„] = 0 for all i and n, which can be interpreted as exogeneity (e.g., no 
misspecihcation bias), then 0 = = u(p„ — Pn) +'nXnQn for all u; and (b) if E[||xj^„|p] < cx) 

for all i and n, then Xn = 0{1). Other sufficient conditions are discussed below. 

Second, the high-level condition maxi<j<„ ||vi^„||/y^ = Op(l) restricts the distributional 
relationship between the finite dimensional covariate of interest Xj^ji and the high-dimensional 
nuisance covariate Wj This condition can be interpreted as a negligibility condition and 
thus comes close to minimal for the central limit theorem to hold. At the present level of 
generality it seems difficult to formulate primitive sufficient conditions for this restriction 
that cover all cases of interest, but for completeness we mention that under mild moment 
conditions it suffices to require that one of the following conditions hold (see Lemma SA-7 
in the supplemental appendix for details and weaker conditions): 

(i) Mn = Op(l), or 
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(ii) Xn = o(l), or 

(iii) maxi<i<„ 7^ 0) = Op{n^/^). 

Each of these conditions is interpretable. First, Ain > Kn/n because Yl^=i ^ii,n = n—Kn 
and a necessary condition for (i) is therefore that Kn/n —)■ 0. Conversely, because 

. / , f min]^<j<^ Mu n 

Mn <- 

n 1 - niaxi<i<„ Mu^n 

the condition Kn/n —)■ 0 is sufficient for (i) whenever the design is “approximately balanced” 
in the sense that (1 — mini<j<„Mjj_„)/(l — maxi<j<„= Op(l). In other words, (i) 
requires and effectively covers the case where it is assumed that Kn is a vanishing fraction 
of n. In contrast, conditions (ii) and (iii) can hold also when Kn is a non-vanishing fraction 
of n, which is the case of primary interest in this paper. 

Because (ii) is a requirement on the accuracy of the approximation E[xj „|wj^„] ^'n^i,n 

with 6n = E[wi^„w'„], primitive conditions for it are available when, for exam¬ 
ple, the elements of Wj „ are approximating functions. Indeed, in such cases one typically 
has Xn = 0{K~'^) for some a > 0, so condition (ii) not only accommodates Kn/n 0, but 
actually places no upper bound on the magnitude of Kn in important special cases. This 
condition also holds when Wj are dummy variables or discrete covariates, as we discuss in 
more detail below. 

Finally, condition (iii), and its underlying higher-level condition described in the supple¬ 
mental appendix, is useful to handle cases where Wj „ cannot be interpreted as approximating 
functions, but rather just many different covariates included in the linear model specihca- 
tion. This condition is a “sparsity” condition on the projection matrix M„, which allows 
for Kn/n 0. The condition is easy to verify in certain cases, including those where “lo¬ 
cally bounded” approximating functions or fixed effects are used (see below for concrete 
examples). 
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3.3 Motivating Examples 


We briefly mention three motivating examples of linear-in-parameter regression models cov¬ 
ered by our results. All technical details are given in the supplemental appendix. 

3.3.1 Linear Regression Model with Increasing Dimension 

This leading example has a long tradition in statistics and econometrics. The model takes (1) 
as the data generating process (DGP), typically with i.i.d. data and the exogeneity condition 
= 0. However, our assumptions only require = o(l), 

and hence (1) can be interpreted as a linear-in-parameters mean-square approximation to 
the unknown conditional expectation E[|/j,i|xjWj^„]. Either way, /3„ is the standard OLS 
estimator. 

Setting Wn = (wi,„,..., Nn = n, 71, „ = {*} and maxi<i<Ar^ #71, n = 1, Assump¬ 
tions 1-2 are standard, while Assumption 3 is satished provided that E[||xj,„||^] < oo [im¬ 
plying Xn = 0(1)], nE[(E[Mi,„|xi,„, Wi,J)2] = o(l) [implying n{Qn - Pn) + nXnQn = o(l)], 
and maxi<j<„ ||vj,„||/#n = Op(l). Primitive sufficient conditions for the latter negligibil¬ 
ity condition can be given as discussed above. For example, under regularity conditions, 
Xn = o(l) if either (a) E[xj,„|wj,„] = 6'wi^n, {b) the nuisance covariates are discrete and a 
saturated dummy variables model is used, or (c) Wj,„ are constructed using sieve functions. 
Alternatively, maxi<j<„ YTj=i # 0) = Op{n}/‘^) is satished provided the distribution of 

the nuisance covariates Wj,„ generates a projection matrix that is approximately a band 
matrix (see below for concrete examples). Precise regularity conditions for this example are 
given in Section 4.1 of the supplemental appendix. 

3.3.2 Fixed Effects Panel Data Regression Model 

A second class of examples covered by our results are linear panel data models with multi¬ 
way hxed effects and related models such as those encountered in networks, spillovers or 
social interactions settings. A common feature in these examples is the presence of possibly 
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many dummy variables in Wj , 1 , capturing unobserved heterogeneity or other unobserved ef¬ 
fects across units (e.g., network link or spillover effect). In many applications the number 
of distinct dummy-type variables is large because researchers often include multi-group in¬ 
dicators, interactions thereof, and similar regressors obtained from factor variables. In these 
complicated models the nuisance covariates need to be estimated explicitly, even in simple 
linear regression problems, because it is not possible to difference out the multi-way indicator 
variables for estimation and inference. 

Stock and Watson (2008) consider heteroskedasticity-robust inference for the one-way 
hxed effect panel data regression model 

Y^it = Oii + + Uit, i = 1,..., N, t = 1,... ,T, (6) 

where Oj G M is an individual-specihc intercept, is a regressor of dimension d, and Uu 
is an scalar error term, and the following assumptions are satished. To map this model 
into our framework, suppose that {{Un ,..., Uix, ..., X'j,) ; 1 < i < n} are independent 
over i, E[f/ji|Xji..., Xj-r] = 0, and ..., = 0 for f 7 ^ s. Then, setting 

~ XT, Kn = N, — (o^l) • • • ) ®v) ) and ~ 

1 < i < N and 1 < t < T, where is the i-th unit vector of 

dimension N, the model ( 6 ) is also of the form (1) and f3n is the hxed effects estimator of 
f3. In general, this model does not satisfy an i.i.d. assumption, but Assumption 1 enables 
us to employ results for independent random variables when developing asymptotics. In 
particular, unlike Stock and Watson (2008), we do not require {Un ,..., ... ,X' 2 .) 

to be i.i.d. over i, nor we require any kind of stationarity on the part of (f/jt,X'J. The 
amount of variance heterogeneity permitted is quite large, since we basically only require 
V[yit|Xji,..., Xjt] = lE[Tj^|Xii,..., Xj-r] to be bounded and bounded away from zero. (On 
the other hand, serial correlation is assumed away because our assumptions imply that 
C[Yit, yj<j|Xji,..., Xjr] = 0 for t 7 ^ s.) In other respects this model is in fact more tractable 
than the previous models due to the special nature of the covariates Wj that is, a dummy 
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variable for each unit i = 1,..., N. 

In this one-way fixed effects example, Kn/n = 1/T and therefore a high-dimensional 
model corresponds to a short panel model; maxi<j<„ 7 ^ 0) = T and hence the 

negligibility condition holds easily. If T > 2, our asymptotic Gaussian approximation for the 
distribution of the least-squares estimator /3„ is valid (see Theorem 1), despite the coefficients 
7 „ not being consistently estimated. On the other hand, consistency of our generic variance 
estimator requires T > 3 [implying Kn/n < 1/2]; see Theorems 3 and 4. Further details 
are given in Section 4.2 of the supplemental appendix, where we also discuss a case-specific 
consistent variance estimator when T = 2. 

Our generic results go beyond one-way fixed effect linear regression models, as they 
can be used to obtain valid inference in other contexts where multi-way fixed effects or 
similar discrete regressors are included. For a second concrete example, consider the recent 
work of Verdier (2017, and references therein) in the context of linear models with two- 
way unobserved heterogeneity and sparsely matched data. This model is isomorphic to a 
network model, where students and teacher (or workers and firms, for another example) are 
“matched” or “connected” over time, but potential unobserved heterogeneity at both levels is 
a concern. In this setting, under random sampling, Verdier (2017) offers primitive conditions 
for our high-level assumptions when two-way hxed effect models are used for estimation and 
inference. In particular, using a clever Markov chain argument (see his Lemma 1), he is 
able to provide different restriction on T and the number of matches in the network to 
ensure consistent variance estimation using the methods developed in this paper. To give 
one concrete example, he hnds that if T > 5 and for any pair of teachers (hrms), the number 
of students (workers) assigned to both teachers (hrms) in the pair is either zero or greater 
than three, then our key high-level condition in Theorem 4 below is verihed. 
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3.3.3 Semiparametric Partially Linear Model 


Another model covered by our results is the partially linear model 

yi = (3'yii +g{7.i) + Ei, i = (7) 

where Xj and Zj are explanatory variables, e, is an error term satisfying E[ej|xj,Zj] = 0 , 
the function giz) is unknown, and sampling is i.i.d. across i is assumed. Suppose {p^(z) : 
k = 1 , 2 , • • •} are functions having the property that linear combinations can approximate 
square-integrable functions of z well, in which case (^(zj) 7 ^p„(zi) for some 7 ^, where 

Pn(z) = (p^(z),... ,p^"(z))'. Defining yi^n = yi, ^i,n = Xj, = Pn(zi), and = 

Ei + g{zi) — 7 ^Wj the model (7) is of the form (1), and is the series estimator of /3; see, 
e.g., Donald and Newey (1994) and Cattaneo, Jansson, and Newey (2017) and references 
therein. 

Constructing the basis Pn(zi) in applications may require using a large either because 
the underlying functions are not smooth enough or because dim(zj) is large. For example, 
if a p = 3 cubic polynomial expansion is used, also known as a power series of order 3, then 
dim(wj) = (p + dim(zi))!/(p! dim(zj)!) = 286 when dim(zj) = 10 , and therefore flexible esti¬ 
mation and inference using the semi-linear model (7) with a sample size of n = 1, 000 gives 
Kn/n = 0.286. For further technical details on series-based methods see, e.g., Newey (1997), 
Chen (2007), Cattaneo and Farrell (2013), and Belloni, Chernozhukov, Chetverikov, and Kato 
(2015), and references therein. For another example, when the basis functions Pn(z) are con¬ 
structed using partitioning estimators, the OLS estimator of {3 becomes a subclassihcation es¬ 
timator, a method that has been proposed in the literature on program evaluation and treat¬ 
ment effects; see, e.g., Cochran (1968), Rosenbaum and Rubin (1983), Cattaneo and Farrell 
(2011), and references therein. When a Partitioning estimator of order 0 is used, the semi- 
linear model becomes a one-way fixed effects linear regression model, where each dummy 
variable corresponds to one (disjoint) partition on the support of zp in this case, is to 
the number of partitions or hxed effects included in the estimation. 
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Our primitive regularity conditions for this example include 


Pn = mm 

7GR^’' 


E[|^(Zi) -7'Pn(Zi)n = 0(1), Xn = 


mm 


E[||E[xj 


<5'p„(z,)|p] = 0(l), 


npnXn = o(l), and the negligibility condition maxi<j<„ ||vj_„||/>/n = Op(l). A key finding 
implied by these regularity conditions is that we only require minimal smoothness conditions 
on g{zi) and E[xj|zj]. The negligibility condition is automatically satished if Xn = o(l), 
as discussed above, but in fact our results do not require any approximation of E[xj|zj], as 
usually assumed in the literature, provided a “locally supported” basis is used; i.e., any basis 
p„(z) that generates an approximately band projection matrix M„; examples of such basis 
include partitioning and spline estimators. See Section 4.3 in the supplemental appendix for 
further discussion and technical details. 


4 Results 

This section presents our main theoretical results for inference in linear regression mod¬ 
els with many covariates and heteroskedasticity. Mathematical proofs, and other technical 
results that may be of independent interest, are given in the supplemental appendix. 

4.1 Asymptotic Normality 

As a means to the end of establishing (2), we give an asymptotic normality result for 
which may be of interest in its own right. 

Theorem 1 Suppose Assumptions 1-3 hold. Then, 


where 


E n ^ ^ / 

i=l 


E[Ul\A:^,Wr.]/ 


n. 
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In the literature on high-dimensional linear models, Mammen (1993) obtains a similar 
asymptotic normality result as in Theorem 1 but under the condition /n —)■ 0 for 
(5 > 0 restricted by certain moment condition on the covariates. In contrast, our result 
only requires Ximn^ooKn/n < 1 , but imposes a different restriction on the high-dimensional 
covariates (e.g., condition (i), (ii) or (iii) discussed previously) and furthermore exploits 
the fact that the parameter of interest is given by the hrst d coordinates of the vector 
(/?', (i.e., in Mammen (1993) notation, it considers the case c = (d,0')' with l denoting 

a d-dimensional vector of ones and 0 denoting a i^„-dimensional vector of zeros). 

In isolation, the fact that Theorem 1 removes the requirement Kn/n —)■ 0 may seem like 
little more than a subtle technical improvement over results currently available. It should 
be recognized, however, that conducting inference turn out to be considerably harder when 
^ri/n 74 0. The latter is an important insight about large-dimensional models that cannot 
be deduced from results obtained under the assumption Kn/n —)■ 0 , but can be obtained with 
the help of Theorem 1. In addition, it is worth mentioning that Theorem 1 is a substantial 
improvement over Cattaneo, Jansson, and Newey (2017, Theorem 1) because here it is not 
required that Kn —00 nor Xn = o(l) —a different method of proof is also used. This 
improvement applies not only to the partially linear model example, but more generally to 
linear models with many covariates, because Theorem 1 applies to quite general form of 
nuisance covariate beyond specihc approximating basis functions. In the specihc case of 
the partially linear model, this implies that we are able to weaken smoothness assumptions 
(or the curse of dimensionality), otherwise required to satisfy the condition Xn = o(l). 

Remark 2. Theorem 1 concerns only distributional properties of /3„. First, this theorem 
implies -^-consistency of j3n because = Op(l) (see Lemmas SA-1 and SA-2 of the 
supplemental appendix). Second, this theorem does require nor imply consistency of 
the (implicit) least squares estimate of 7 „, as in fact such a result will not be true in 
most applications with many nuisance covariates For example, in a partially linear 
model (7) the approximating coefficients Xn will not be consistently estimated unless 
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Kn/n —)■ 0, or in a one-way fixed effect panel data model (6) the unit-specific coefficients 
in 7„ will not be consistently estimated unless Kn/n = 1/T ^ 0. Nevertheless, 
Theorem 1 shows that Pn can still be root-n asymptotically normal under fairly general 
conditions; this result is due to the intrinsic linearity and additive separability of the 
model (1). 


4.2 Variance Estimation 

Achieving (2), the counterpart of (8) in which the unknown matrix is replaced by the es¬ 
timator requires additional assumptions. One possibility is to impose homoskedasticity. 

Theorem 2 Suppose the assumptions of Theorem 1 hold. //Wn] = cr^, then (2) 
holds with 

This result shows in quite some generality that homoskedastic inference in linear models 
remains valid even when Kn is proportional to n, provided the variance estimator incorpo¬ 
rates a degrees-of-freedom correction, as S™ does. 

Establishing (2) is also possible when Kn is assumed to be a vanishing fraction of n, 
as is of course the case in the usual hxed-A'n linear regression model setup. The following 
theorem establishes consistency of the conventional standard error estimator S™ under the 
assumption 0, and also derives an asymptotic representation for estimators of the 

form without imposing this assumption, which is useful to study the asymptotic 

properties of other members of the HCfc class of standard error estimators. 


Theorem 3 Suppose the assumptions of Theorem 1 hold. 

(a) If Kin -^p 0, then (2) holds with = S™. 

(b) If IlKnIloo = maxi<i<„ \f^ij,n\ = Op(l), then 


Kn = 


^ kVn] + Op{l). 


i=l j=l k=l 
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The conclusion of part (a) typically fails when the condition Kn/n —)■ 0 is dropped. For 
example, when specialized to K,n = In part (b) implies that in the homoskedastic case (i.e., 
when the assumptions of Theorem 2 are satisfied) the standard estimator S™ is asymptot¬ 
ically downward biased in general (unless Kn/n —)■ 0). In the following section we make 
this result precise and discuss similar results for other popular variants of the HC/c standard 
error estimators mentioned above. 

On the other hand, because J2i<k<n '^lk,n^kj,n = I(* = j) by construction, part (b) 
implies that is consistent provided ||k|)^|1oo = Op{l). A simple condition for this to occur 
can be stated in terms of Ain- Indeed, if Ain <1/2, then is diagonally dominant and it 
follows from Theorem 1 of Varah (1975) that 

As a consequence, we obtain the following theorem, whose conditions can hold even if 
Kn/n 0. 

Theorem 4 Suppose the assumptions of Theorem 1 hold. 

If^lAin < 1/2] ^ 1 and if l/{l/2 — Ain) = Op{l), then (2) holds with 

Because Ain > Kn/n, a necessary condition for Theorem 4 to be applicable is that 
hm„^ooA'„/n < 1/2. When the design is balanced, that is, when Mu^n = ... = Mnn,n (as 
occurs in the panel data model (6)), the condition hm„_,.ooA'„/?7, < 1/2 is also sufficient, but 
in general it seems difficult to formulate primitive sufficient conditions for the assumption 
made about A4„ in Theorem 4. In practice, the fact that Ain is observed means that the 
condition Ain < 1/2 is verifiable, and therefore unless Ain is found to be “close” to 1/2 
there is reason to expect to perform well. 

Remark 3. Our main results for linear models concern large-sample approximations for 
the finite-sample distribution of the usual t-statistics. An alternative, equally auto¬ 
matic approach is to employ the bootstrap and closely related resampling procedures 
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(see, among others, Freedman (1981), Mammen (1993), Gonalvez and White (2005), 
Kline and Santos (2012)). Assuming Kn/n 0, Bickel and Freedman (1983) demon¬ 
strated an invalidity result for the bootstrap. We conjecture that similar results can be 
obtained for other resampling procedures. Furthermore, we also conjecture that em¬ 
ploying appropriate resampling methods on the “bias-corrected” residuals uj^ (Remark 
1) can lead to valid inference procedures. Investigating these conjectures, however, is 
beyond the scope of this paper. Following the recommendation of a reviewer, we 
explored the numerical performance of the standard nonparametric bootstrap in our 
simulation study, where we found that indeed bootstrap validity seems to fail in the 
high-dimensional settings we considered. 

4.3 HC/c Standard Errors with Many Covariates 

The HC/c variance estimators are very popular in empirical work, and in our context are 
of the form Sn(Kn) with for some choice of (Tj_„,^i^„). See 

Long and Ervin (2000) and MacKinnon (2012) for reviews. Theorem 3(b) can be used to 
formulate conditions, including Kn/n —)■ 0, under which these estimators are consistent in 
the sense that 


^ r). ( ) 


Sn + Op(l), E[t/^ |dG,Wn]. 

n 


i=l 


More generally. Theorem 3(b) shows that, if Kij^n = ll(* = j)^, then 


Sn(^n) — Sr 


(^n) T ^p(f)) Sjj(k^) 


n 


EEt 

r=l j=l 






E[f/? 


We therefore obtain the following (mostly negative) results about the properties of HC/c 
estimators when Kn/n 0, that is, when potentially many covariates are included. 
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HCO: = (1,0). \i¥.[UUX^,Wn] = al then 


- — y'(l - „ < S„, 

n 

2=1 


with n ^ 7^ Op(l) in general (unless 0). Thus, !]„(«„) = 

S™ is inconsistent in general. In particular, inference based on S™ is asymptotically 
liberal (even) under homoskedasticity. 

HCl: (T,,„, e.,n) = (n/(n - K^), 0). If E[172jdf„, Wn] = and if Mn,„ = ... = then 

^ni^n) = but in general this estimator is inconsistent when Kn/n ^ 0 (and so is 
any other scalar multiple of S™). 

HC2: = (1,1). If E[17|^|T’„,>V„] = al, then S„(k„) = S„, but in general this 

estimator is inconsistent under heteroskedasticity when Kn/n ^ 0. For instance, if 
d = 1 and if E[f/|^|A’„, Wn] = then 


^n(^n) 


n 


EE 

1=1 j=i 


M2 


-1 

jj,n 


)-l{t= 3)]vljl^ ^ Op(l) 


in general (unless Kn/n —)■ 0). 


HC3: {/Ti^n)ii,n) = (1,2). Inference based on this estimator is asymptotically conservative 
because 


n n 


S„(fi;„) - ^ M.._2M2._„Vi,„v'>VJ > 0, 


i=l j=l,j^i 


where n ^ Er=i ^ Op(l) in general (unless 

Kn/n -)■ 0 ). 


HC4: (Ti,n,Ci,n) = (l,min(4,nMii,n/iF„)). If Mu^n = ... = M„„,„ = 2/3 (as occurs when 
T = 3 in the hxed effects panel data model), then HC4 reduces to HC3, so this 
estimator is also inconsistent in general. 
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Among other things these results show that (asymptotically) conservative inference in 
linear models with many covariates (i.e., even when K/n yA 0) can be conducted using 
standard linear methods (and software), provided the HC3 standard errors are used. 

In the numerical work reported in the following sections and the supplemental appendix, 
we present evidence comparing all these standard error estimators. In particular, we find 
that indeed standard OLS-based conhdence intervals employing HC3 standard errors are 
always quite conservative. Furthermore, we also hnd that our proposed variance estimator 
delivers confidence intervals with close-to-correct empirical coverage. 

5 Simulations 

We conducted a simulation study to assess the finite sample properties of our proposed 
inference methods as well as those of other standard inference methods available in the 
literature. Based on the generic linear regression model (1), we consider 15 distinct data 
generating processes (DGPs) motivated by the three examples discussed above. To conserve 
space, here we only discuss results from Model 1, a representative case, but the supplemental 
appendix contains the full set of results and further details (see Table 1 in the supplement 
for a synopsis of the DGPs used). 

We discuss results for a linear model (1) with i.i.d. data, n = 700, d = 1 and Xi^n ~ 
Normal(0, 1), = l(vj^„ > 2.5) with ~ Normal(0, Ia'^), and Ui^n ~ Normal(0, 1), all 

independent of each other. Thus, this design considers (possibly overlapping) sparse dummy 
variables entering each column assigns a value of 1 to approximately hve units out of u = 
700. We set /3 = 1 and = 0, and considered hve different model dimensions: dim(wj_„) = 
Kn G {1,71,141,211,281}. In the supplemental appendix we also present results for more 
sparse dummy variables in the context of one-way and two-way linear panel data regression 
models, and for non-binary covariates Wj „ in both increasing dimension linear regression 
settings and semiparametric partially linear regression settings (where 7^ 0 and is 
constructed using power series expansions). Furthermore, we also consider an asymmetric 
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and a bimodal distribution for the unobservable error terms. In all cases the numerical 


results are qualitatively similar to those discussed herein. For each DGP, we investigate both 
homoskedastic as well as (conditional on Xi^n and/or Wj „) heteroskedastic models, following 
closely the specihcations in Stock and Watson (2008) and MacKinnon (2012). In particular, 
our heteroskedastic model takes the form: = x„(l + (t{xi^n) + 

and = x^(l + (dwj^„)^), where the constants x„ and x„ are chosen so that 

= 1, and t{a) = al{—2 < a < 2) + 2sgn(a)(l — l(—2 < a < 2)). 

We conducted S = 5, 000 simulations to study the hnite sample performance of 16 con- 
hdence intervals: eight based on a Gaussian approximation and eight based on a bootstrap 
approximation. Our paper offers theory for Gaussian-based inference methods, but we also 
included bootstrap-based inference methods for completeness (as discussed in Remark 3, the 
bootstrap is invalid when Kn oc n in linear regression models). For each inference method, 
we report both average coverage frequency and interval length of 95% nominal conhdence 
intervals; the latter provides a summary of efficiency/power for each inference method. To 
be more specihc, for a = 0.05, the conhdence intervals take the form: 







where qjl = q^^{a) and qi{a) denotes a cumulative distribution function, and with 
i G {HOO, HOI, HGO, HGl, HC2, HC3, HC4, HCiF} corresponds the variance estimators dis¬ 
cussed in Sections 2 and 4.3. Gaussian-based methods set q{a) equal to the standard normal 
distribution for all i, while bootstrap-based methods are based on the nonparametric boot¬ 
strap distributional approximation to the distribution of the t-test = (/3„ — (3)/\J(ln,e/n. 
The empirical coverage of these 16 conhdence intervals are reported in Panel (a) of Table 
1. In addition. Panel (b) of Table 1 reports the average interval length of each conhdence 
intervals, which is computed as U = {q^l_a /2 ~ ' \/which ohers a summary of 

hnite sample power/efficiency of each inference method. 

The main hndings from the simulation study are in line with our theoretical results. To 
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be precise, we find that the confidence interval estimators constructed using our proposed 
standard errors formula denoted B.CK, offer close-to-correct empirical coverage. The 
alternative heteroskedasticity consistent standard errors currently available in the literature 
lead to confidence intervals that could deliver substantial under or over coverage depending on 
the design and degree of heteroskedasticity considered. We also find that inference based on 
HC3 standard errors is conservative, a general asymptotic result that is formally established 
in this paper. Bootstrap-based methods seem to perform better than their Gaussian-based 
counterparts, but they never outperform our proposed Gaussian-based inference procedure 
nor do they provide close-to-correct empirical coverage across all cases. Finally, our proposed 
confidence intervals also exhibit very good average interval length. 

6 Empirical Illustration 

We illustrate the different linear regression inference methods discussed in this paper using a 
real data set to study the effect of ability on earnings. In particular, we employ the dataset 
constructed by Garneiro, Heckman, and Vytlacil (2011, GHV, hereafter). [The dataset is 
available at https: //www. aeaweb. org/articles?id=10.1257/aer .101.6.2754.]. The data 
comes from the 1979 National Longitudinal Survey of Youth (NLSY79), which surveys indi¬ 
viduals born in 1957-1964 and includes basic demographic, economic and educational infor¬ 
mation for each individual. It also includes a well-known proxy for ability (beyond schooling 
and work experience): the Armed Forces Qualification Test (AFQT), which gives a measure 
usually understood as a proxy for their intrinsic ability for the respondent. This data has 
been used repeatedly to either control for or estimate the effects of ability in empirical studies 
in economics and other disciplines. See GHV for further details and references. 

The sample is composed of white males of ages between 28 and 34 years of old in 1991, at 
most 5 siblings, and with at least incomplete secondary education. We split the sample into 
individuals with high school dropouts and high school graduates, and individuals with some 
college, college graduates, and postgraduates. For each subsample, we consider the linear 
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regression model (1) with yt^n = log(wagesJ, where wages- is the log wage in 1991 of unit 
i, Xi^ri = af qt^ denotes the (adjusted) standardized AFQT score for unit i, and collects 
several survey, geographic and dummy variables for unit i. In particular, Wj „ includes the 14 
covariates described in CHV (Table 2, p. 2763), a dummy variable for wether the education 
level was completed, eight cohort hxed effects, county hxed effects, and cohort-county hxed 
effects. For our illustration, we further restrict the sample to units in counties with at least 
3 survey respondents, giving a total of Kn = 122 and n = 436 {Kn/n = 0.280, = 0.422) 

for high school educated units and = 123 and n = 452 {Kn/n = 0.272, M.n = 0.411) 
college educated units. 

The empirical hndings are reported in Table 2. For high school educated individuals, we 
hnd an estimated returns to ability of /? = 0.060. The statistical signihcance of this effect, 
however, depends on the inference method employed. If homoskedastic consistent standard 
errors are used, then the effect is statistical signihcant (p-values are 0.010 and 0.029 for unad¬ 
justed and degrees-of-freedom adjusted standard errors, respectively). If heteroskedasticity 
consistent standard errors are used, the default method in most empirical studies, then the 
statistical signihcance depends on the which inference method is used; see Section 4.3. In 
particular, HCO also gives a statistically signihcant result (p-value is 0.020), while HCl and 
HC2 deliver marginal signihcance (both p-values are 0.048). On the other hand, HC3 and 
HC4 give p-values of 0.092 and 0.122, respectively, and hence suggest that the point estimate 
is not statistically distinguishable from zero. Finally, our proposed standard error, HCiF, 
gives a p-value of 0.058, also making /3 = 0.060 statistically insignihcant at the conventional 
5-pecent level. In contrast, for college educated individuals, we hnd an ehect of = 0.091, 
and all inference methods indicate that this estimated returns to ability is statistically sig¬ 
nihcant at conventional levels. In particular, HC3 and our proposed standard errors HCiF 
give p-values of 0.037 and 0.017, respectively. 

This illustrative empirical application showcases the role of our proposed inference method 
for empirical work employing linear regression with possibly many covariates; in this applica¬ 
tion, Kn large relative to n {Kn/n ^ 0.3 ) is quite natural due to the presence of many county 
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and cohort fixed effects. Specifically, when stndying the effect of ability on earnings for high 
school educated individuals, the statistical significance of the results crucially depend on the 
inference methods used: as predicted by our theoretical findings, inference methods that are 
not robust to the inclusion of many covariates tend to deliver statistically significant results, 
while methods that are robust (HC3 is asymptotically conservative and YiCK is asymptot¬ 
ically correct) do not deliver statistically significant results, giving an example where the 
empirical conclusion may change depending on whether the presence of many covariates is 
taken into account when conducting inference. In contrast, the empirical findings for college 
educated individuals appear to be statistically significant and robust across all inference 
methods. 


7 Conclusion 

We established asymptotic normality of the OLS estimator of a subset of coefficients in 
high-dimensional linear regression models with many nuisance covariates, and investigated 
the properties of several popular heteroskedasticity-robust standard error estimators in this 
high-dimensional context. We showed that none of the usual formulas deliver consistent 
standard errors when the number of covariates is not a vanishing proportion of the sample 
size. We also proposed a new standard error formula that is consistent under (conditional) 
heteroskedasticity and many covariates, which is fully automatic and does not assume special, 
restrictive structure on the regressors. 

Our results concern high-dimensional models where the number of covariates is at most 
a non-vanishing fraction of the sample size. A quite recent related literature concerns 
ultra-high-dimensional models where the number of covariates is much larger than the 
sample size, but some form of (approximate) sparsity is imposed in the model; see, e.g., 

Belloni, Chernozhukov, and Hansen (2014), Farrell (2015), Belloni, Chernozhukov, Hansen, and Fernandez 
(2017), and references therein. In that setting, inference is conducted after covariate selec¬ 
tion, where the resulting number of selected covariates is at most a vanishing fraction of the 


sample size (usually much smaller). An implication of the results obtained in this paper is 
that the latter assumption cannot be dropped if post covariate selection inference is based 
on conventional standard errors. It would therefore be of interest to investigate whether the 
methods proposed herein can be applied also for inference post covariate selection in ultra- 
high-dimensional settings, which would allow for weaker forms of sparsity because more 
covariates could be selected for inference. 
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Table 1: Simulation Results (Model 1 in Supplemental Appendix). 

(a) Empirical Coverage 


Gaussian Distributional Approximation Bootstrap Distributional Approximation 




HOO 

HOI 

HCO 

HCl 

HC2 

HC3 

HC4 

HCK 

HOO 

HOI 

HCO 

HCl 

HC2 

HC3 

HC4 

HCK 

Homoskedastic Model 
K/n = 0.001 

0.949 

0.950 

0.948 

0.948 

0.948 

0.948 

0.948 

0.948 

0.946 

0.946 

0.943 

0.943 

0.943 

0.943 

0.943 

0.943 

K/n = 

0.101 

0.939 

0.956 

0.939 

0.952 

0.952 

0.962 

0.980 

0.951 

0.951 

0.951 

0.947 

0.947 

0.948 

0.949 

0.947 

0.942 

K jn = 

0.201 

0.916 

0.947 

0.919 

0.947 

0.946 

0.968 

0.989 

0.945 

0.965 

0.965 

0.950 

0.950 

0.949 

0.946 

0.944 

0.939 

K/n = 

0.301 

0.900 

0.950 

0.904 

0.954 

0.951 

0.977 

0.983 

0.949 

0.980 

0.980 

0.961 

0.961 

0.949 

0.931 

0.948 

0.933 

K/n = 

0.401 

0.881 

0.954 

0.884 

0.955 

0.952 

0.989 

0.972 

0.949 

0.989 

0.989 

0.976 

0.976 

0.956 

0.928 

0.967 

0.944 

Heteroskedastic Model 
K/n = 0.001 

0.880 

0.880 

0.945 

0.945 

0.945 

0.945 

0.946 

0.945 

0.939 

0.939 

0.937 

0.937 

0.937 

0.937 

0.937 

0.937 

K/n = 

0.101 

0.725 

0.750 

0.885 

0.904 

0.926 

0.957 

0.989 

0.948 

0.897 

0.897 

0.916 

0.906 

0.907 

0.909 

0.902 

0.919 

K/n = 

0.201 

0.762 

0.804 

0.853 

0.901 

0.924 

0.973 

0.995 

0.945 

0.919 

0.919 

0.919 

0.909 

0.908 

0.907 

0.908 

0.920 

K/n = 

0.301 

0.784 

0.856 

0.837 

0.903 

0.926 

0.981 

0.977 

0.947 

0.944 

0.944 

0.936 

0.926 

0.919 

0.903 

0.920 

0.920 

K/n = 

0.401 

0.758 

0.875 

0.792 

0.908 

0.929 

0.990 

0.950 

0.948 

0.975 

0.975 

0.962 

0.962 

0.936 

0.900 

0.953 

0.926 


00 (b) Interval Length 


Gaussian Distributional Approximation Bootstrap Distributional Approximation 




HOO 

HOI 

HCO 

HCl 

HC2 

HC3 

HC4 

HCK 

HOO 

HOI 

HCO 

HCl 

HC2 

HC3 

HC4 

HCK 

Homoskedastic Model 
K/n = 0.001 

0.148 

0.148 

0.148 

0.148 

0.148 

0.148 

0.148 

0.148 

0.148 

0.148 

0.149 

0.149 

0.149 

0.149 

0.149 

0.149 

K/n = 

0.101 

0.148 

0.156 

0.148 

0.157 

0.156 

0.165 

0.186 

0.156 

0.161 

0.161 

0.158 

0.158 

0.158 

0.158 

0.158 

0.157 

K/n = 

0.201 

0.148 

0.166 

0.149 

0.167 

0.165 

0.185 

0.225 

0.165 

0.180 

0.180 

0.170 

0.170 

0.169 

0.167 

0.166 

0.164 

K/n = 

0.301 

0.148 

0.177 

0.150 

0.179 

0.177 

0.212 

0.219 

0.177 

0.210 

0.210 

0.189 

0.189 

0.182 

0.172 

0.180 

0.174 

K/n = 

0.401 

0.148 

0.192 

0.150 

0.194 

0.191 

0.247 

0.213 

0.190 

0.260 

0.260 

0.223 

0.223 

0.200 

0.174 

0.212 

0.189 

Heteroskedastic Model 
K/n = 0.001 

0.148 

0.148 

0.186 

0.186 

0.186 

0.186 

0.187 

0.186 

0.186 

0.186 

0.188 

0.188 

0.188 

0.188 

0.188 

0.188 

K/n = 

0.101 

0.148 

0.156 

0.213 

0.225 

0.241 

0.273 

0.357 

0.254 

0.243 

0.243 

0.264 

0.264 

0.266 

0.268 

0.273 

0.269 

K/n = 

0.201 

0.148 

0.166 

0.187 

0.209 

0.226 

0.276 

0.353 

0.244 

0.243 

0.243 

0.252 

0.252 

0.251 

0.248 

0.251 

0.249 

K/n = 

0.301 

0.148 

0.177 

0.170 

0.203 

0.219 

0.287 

0.278 

0.240 

0.259 

0.259 

0.254 

0.254 

0.244 

0.232 

0.247 

0.239 

K/n = 

0.401 

0.148 

0.191 

0.159 

0.206 

0.220 

0.310 

0.239 

0.241 

0.300 

0.300 

0.276 

0.276 

0.248 

0.218 

0.269 

0.243 


Notes: (i) DGP is Model 1 from the supplemental appendix, sample size is n = 700, number of bootstrap replications is B = 500, and number of simulation replications is 
S = 5,000; (ii) Columns HOO and HOI correspond to confidence intervals using homoskedasticity consistent standard errors without and with degrees of freedom correction, 
respectively, columns HC0-HC4 correspond to confidence intervals using the heteroskedasticity consistent standard errors discussed in Sections 2 and 4.3, and columns HCA" 
correspond to confidence intervals using our proposed standard errors estimator. 
















Table 2: Empirical Application (Returns to Ability, AFQT Score). 


(a) Secondary Education 



Outcome: 

log (wages) 

/S 

0.060 



Std.Err. 

p-value 

HOO 

0.023 

0.010 

HOI 

0.028 

0.029 

HCO 

0.026 

0.020 

HCl 

0.030 

0.048 

HC2 

0.030 

0.048 

HC3 

0.036 

0.092 

HC4 

0.039 

0.122 

ECK 

0.032 

0.058 

Kn 

122 


n 

436 


Kn/n 

0.280 


Mn 

0.422 



(b) College Education 



Outcome: 

log(wages) 

/3 

0.091 



Std.Err. 

p-value 

HOO 

0.032 

0.005 

HOI 

0.038 

0.016 

HCO 

0.033 

0.006 

HCl 

0.039 

0.018 

HC2 

0.038 

0.016 

HC3 

0.044 

0.037 

HC4 

0.048 

0.058 

hca: 

0.038 

0.017 

Kn 

123 


n 

452 


Kn/n 

0.272 


Mn 

0.411 
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